Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

72 ◾ Bioinformatics

you will be able to use it from any directory. While you are in the “bwa” directory run the

“pwd” command to print the absolute path of BWA, copy it, then change to your home

directory, and open “.bashrc” file using “vim” or any text editor of your choice:

cd $HOME

vim .bashrc

Add the following to the end of the “.bashrc” file:

export PATH=”your_path/bwa”:$PATH

Do not forget to replace “your_path” with the path to the “bwa” directory on your com-

puter. Save the “.bashrc” file, exit, and restart the terminal for the change to take effect.

Type “bwa” on the terminal and press the enter key. If the BWA software was installed and

added to the path correctly, you will see the help screen.

BWA has three alignment algorithms: BWA-MEM “bwa mem”, BWA-SW “bwa bwasw”,

and BWA-backtrack “bwa aln/samse/sampe”. Both “bwa mem” and “bwa bwasw” algo-

rithms are used for mapping short and long sequences produced by any of the sequenc-

ing technologies. The “bwa aln/samse/sampe” also called BWA-backtrack is designed for

Illumina short-sequence reads up to 100 bp. Among the three algorithms, “bwa mem” is

the most accurate and the fastest.

Indeed, aligning read sequences to a reference genome with BWA requires indexing the

reference genome using “bwa index” command. We can use this command to index the

human reference genome which was downloaded and indexed with “samtools faidx” above

as follows:

bwa index GRCh38.p13_ref.fna

The indexing will take some time depending on the size of the reference genome and the

memory of your computer. When the “bwa index” command finishes indexing, it will dis-

play the information, including the number of iterations, the elapsed time in second, the

indexed FASTA file name, and the real time and CPU time taken for the indexing process.

The indexing of the human genome may take up to six hours on a desktop computer of

32G RAM.

The BWA indexing process creates five bwa index files with extensions “.amb”, “.ann”,

“.bwt”, “.pac”, and “.sa”. The total storage space for the current human reference genome

and its index files is around 9.4G.

The “.amb” file indexes the locations of the ambiguous (unknown) bases in the FASTA

reference file that are flagged as N or another character but not as A,C,G, or T. The “.ann”

file contains annotation information such as sequence IDs and chromosome numbers.

The “.bwt” is a binary file for the Burrows–Wheeler transformed sequence. The “.pac” is a

binary file for the packed reference sequence. The “.sa” is also a binary file containing the

suffix array index. For mapping read sequences to the reference genome, all these five files

must be together in the same directory.